Download heart disease dataset heart.csv in Resources folder and do following, https://www.kaggle.com/fedesoriano/heart-failure-prediction
1.Consider the heart disease dataset in pandas dataframe 2.Remove outliers using mean,median,Z score. 3.Convert text columns to numbers using label encoding and one hot encoding 4.Apply scaling 5.Build a machine learning classification model using support vector machine. Demonstrate the standalone model as well as Bagging model and include observations about the performance 6.Now use decision tree classifier. Use standalone model as well as Bagging and check if you notice any difference in performance 7.Comparing performance of svm and decision tree classifier figure out where it makes most sense to use bagging and why.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
plotly.offline.init_notebook_mode()
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from scipy.stats import zscore
from scipy import stats
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,mean_absolute_error,mean_squared_error,r2_score
df = pd.read_csv("csv/heart.csv")
df.head()
| Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
| 1 | 49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
| 2 | 37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up | 0 |
| 3 | 48 | F | ASY | 138 | 214 | 0 | Normal | 108 | Y | 1.5 | Flat | 1 |
| 4 | 54 | M | NAP | 150 | 195 | 0 | Normal | 122 | N | 0.0 | Up | 0 |
df.shape
(918, 12)
# displaying the summary of the DataFrame,
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 918 entries, 0 to 917 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 918 non-null int64 1 Sex 918 non-null object 2 ChestPainType 918 non-null object 3 RestingBP 918 non-null int64 4 Cholesterol 918 non-null int64 5 FastingBS 918 non-null int64 6 RestingECG 918 non-null object 7 MaxHR 918 non-null int64 8 ExerciseAngina 918 non-null object 9 Oldpeak 918 non-null float64 10 ST_Slope 918 non-null object 11 HeartDisease 918 non-null int64 dtypes: float64(1), int64(6), object(5) memory usage: 86.2+ KB
# summary statistics of a DataFrame.
df.describe()
| Age | RestingBP | Cholesterol | FastingBS | MaxHR | Oldpeak | HeartDisease | |
|---|---|---|---|---|---|---|---|
| count | 918.000000 | 918.000000 | 918.000000 | 918.000000 | 918.000000 | 918.000000 | 918.000000 |
| mean | 53.510893 | 132.396514 | 198.799564 | 0.233115 | 136.809368 | 0.887364 | 0.553377 |
| std | 9.432617 | 18.514154 | 109.384145 | 0.423046 | 25.460334 | 1.066570 | 0.497414 |
| min | 28.000000 | 0.000000 | 0.000000 | 0.000000 | 60.000000 | -2.600000 | 0.000000 |
| 25% | 47.000000 | 120.000000 | 173.250000 | 0.000000 | 120.000000 | 0.000000 | 0.000000 |
| 50% | 54.000000 | 130.000000 | 223.000000 | 0.000000 | 138.000000 | 0.600000 | 1.000000 |
| 75% | 60.000000 | 140.000000 | 267.000000 | 0.000000 | 156.000000 | 1.500000 | 1.000000 |
| max | 77.000000 | 200.000000 | 603.000000 | 1.000000 | 202.000000 | 6.200000 | 1.000000 |
# counting the number of missing values (NaN) in each column of the DataFrame.
df.isnull().sum()
Age 0 Sex 0 ChestPainType 0 RestingBP 0 Cholesterol 0 FastingBS 0 RestingECG 0 MaxHR 0 ExerciseAngina 0 Oldpeak 0 ST_Slope 0 HeartDisease 0 dtype: int64
# plotting a correlation plot
px.imshow(df.corr(numeric_only=True ),title="Correlation Plot of the Heat Failure Prediction")
# plotting the distribution of sex
sns.countplot(x = "Sex",data = df)
<Axes: xlabel='Sex', ylabel='count'>
# plotting the distribution of heartdisease
fig=px.histogram(df,
x="HeartDisease",
color="Sex",
hover_data=df.columns,
title="Distribution of Heart Diseases",
barmode="group")
fig.show()
# histogram of chestpain types
fig=px.histogram(df,
x="ChestPainType",
color="Sex",
hover_data=df.columns,
title="Types of Chest Pain"
)
fig.show()
Outiler removal
# plotting histogram of the datset
plt.figure(figsize=(15,10))
for i,col in enumerate(df.columns,1):
plt.subplot(4,3,i)
plt.title(f"Distribution of {col} Data")
sns.histplot(df[col],kde=True)
plt.tight_layout()
plt.plot()
Looking into the histogram most of the data have normal deviation
# box plot for the dataset
plt.figure(figsize=(15,10))
df_num = df.select_dtypes(include=['float64', 'int64'])
for i,col in enumerate(df_num.columns,1):
plt.subplot(4,3,i)
plt.title(f"Distribution of {col} Data")
sns.boxplot(df_num[col], color='lightgreen')
plt.tight_layout()
plt.plot()
Outliers being visible from the box plot.
Using Mean and Standard deviation:
# removing outilers using mean and std.
df_without_outlier_mean = df.copy()
for column in df_without_outlier_mean.select_dtypes(include=[np.number]).columns:
mean = df_without_outlier_mean[column].mean()
std = df_without_outlier_mean[column].std()
df_without_outlier_mean = df_without_outlier_mean[(df_without_outlier_mean[column] >= mean - 3*std) & (df_without_outlier_mean[column] <= mean + 3*std)]
df_without_outlier_mean.shape
(899, 12)
Using Median and IQR:
# removing outliers from using median and IQR.
df_without_outlier_median = df.copy()
for column in df_without_outlier_median.select_dtypes(include=[np.number]).columns:
Q1 = df_without_outlier_median[column].quantile(0.25)
Q3 = df_without_outlier_median[column].quantile(0.75)
IQR = Q3 - Q1
df_without_outlier_median = df_without_outlier_median[(df_without_outlier_median[column] >= Q1 - 1.5*IQR) & (df_without_outlier_median[column] <= Q3 + 1.5*IQR)]
df_without_outlier_median.shape
(587, 12)
Using Z-score:
# performing outlier removal using z-scores.
df_without_outlier_zscore = df.copy()
z_scores = np.abs(stats.zscore(df_without_outlier_zscore.select_dtypes(include=['int64', 'float64'])))
df_without_outlier_zscore = df_without_outlier_zscore[(z_scores < 3).all(axis=1)]
df_without_outlier_zscore.shape
(899, 12)
Looking at the results removing the outilers with meidan and IQR seems to remove the most outlier. However, since our data are not skewed we would not be using median for our model. I will be using the dataframe with no outlier using z-socre for our further processes.
Label and One Hot Encoding:
df_without_outlier_zscore.head()
| Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
| 1 | 49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
| 2 | 37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up | 0 |
| 3 | 48 | F | ASY | 138 | 214 | 0 | Normal | 108 | Y | 1.5 | Flat | 1 |
| 4 | 54 | M | NAP | 150 | 195 | 0 | Normal | 122 | N | 0.0 | Up | 0 |
df_without_outlier_zscore.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 899 entries, 0 to 917 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 899 non-null int64 1 Sex 899 non-null object 2 ChestPainType 899 non-null object 3 RestingBP 899 non-null int64 4 Cholesterol 899 non-null int64 5 FastingBS 899 non-null int64 6 RestingECG 899 non-null object 7 MaxHR 899 non-null int64 8 ExerciseAngina 899 non-null object 9 Oldpeak 899 non-null float64 10 ST_Slope 899 non-null object 11 HeartDisease 899 non-null int64 dtypes: float64(1), int64(6), object(5) memory usage: 91.3+ KB
# getting the categorical columns
categorical_col = df_without_outlier_zscore.select_dtypes(include=['object']).columns
categorical_col
Index(['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'], dtype='object')
# finding the unique values in the categorical columns
unique_data = {column: df_without_outlier_zscore[column].unique() for column in categorical_col}
unique_data
{'Sex': array(['M', 'F'], dtype=object),
'ChestPainType': array(['ATA', 'NAP', 'ASY', 'TA'], dtype=object),
'RestingECG': array(['Normal', 'ST', 'LVH'], dtype=object),
'ExerciseAngina': array(['N', 'Y'], dtype=object),
'ST_Slope': array(['Up', 'Flat', 'Down'], dtype=object)}
Looking closer into the data we find some columns such as Sex and ExerciseAngina that have only two values, could be label encoded and other that have more could be used with one-hot encoding.
# label encoding for the Sex and ExcersieAgnia column
le = LabelEncoder()
df_heart = df_without_outlier_zscore
df_heart['Sex'] = le.fit_transform(df_heart['Sex'])
df_heart['ExerciseAngina'] = le.fit_transform(df_heart['ExerciseAngina'])
# One hot encoding for other columns
df_heart = pd.get_dummies(df_heart, columns=['ChestPainType', 'RestingECG', 'ST_Slope'], drop_first=True)
df_heart.head()
| Age | Sex | RestingBP | Cholesterol | FastingBS | MaxHR | ExerciseAngina | Oldpeak | HeartDisease | ChestPainType_ATA | ChestPainType_NAP | ChestPainType_TA | RestingECG_Normal | RestingECG_ST | ST_Slope_Flat | ST_Slope_Up | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 1 | 140 | 289 | 0 | 172 | 0 | 0.0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 49 | 0 | 160 | 180 | 0 | 156 | 0 | 1.0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 2 | 37 | 1 | 130 | 283 | 0 | 98 | 0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 3 | 48 | 0 | 138 | 214 | 0 | 108 | 1 | 1.5 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 4 | 54 | 1 | 150 | 195 | 0 | 122 | 0 | 0.0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
Applying Scaling:
# scaling and seperating the data
X = df_heart.drop('HeartDisease', axis=1)
y = df_heart['HeartDisease']
scaler = StandardScaler()
X = scaler.fit_transform(X)
pd.DataFrame(X)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.428154 | 0.515943 | 0.465900 | 0.849636 | -0.550362 | 1.384320 | -0.822945 | -0.855469 | 2.063325 | -0.534905 | -0.229550 | 0.809702 | -0.489898 | -0.998888 | 1.134695 |
| 1 | -0.475855 | -1.938199 | 1.634714 | -0.168122 | -0.550362 | 0.752973 | -0.822945 | 0.137516 | -0.484655 | 1.869492 | -0.229550 | 0.809702 | -0.489898 | 1.001113 | -0.881294 |
| 2 | -1.745588 | 0.515943 | -0.118507 | 0.793612 | -0.550362 | -1.535661 | -0.822945 | -0.855469 | 2.063325 | -0.534905 | -0.229550 | -1.235023 | 2.041241 | -0.998888 | 1.134695 |
| 3 | -0.581666 | -1.938199 | 0.349019 | 0.149344 | -0.550362 | -1.141069 | 1.215148 | 0.634008 | -0.484655 | -0.534905 | -0.229550 | 0.809702 | -0.489898 | 1.001113 | -0.881294 |
| 4 | 0.053200 | 0.515943 | 1.050307 | -0.028064 | -0.550362 | -0.588640 | -0.822945 | -0.855469 | -0.484655 | 1.869492 | -0.229550 | 0.809702 | -0.489898 | -0.998888 | 1.134695 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 894 | -0.899099 | 0.515943 | -1.287320 | 0.616205 | -0.550362 | -0.194048 | -0.822945 | 0.336112 | -0.484655 | -0.534905 | 4.356349 | 0.809702 | -0.489898 | 1.001113 | -0.881294 |
| 895 | 1.534554 | 0.515943 | 0.699663 | -0.046738 | 1.816985 | 0.161085 | -0.822945 | 2.520678 | -0.484655 | -0.534905 | -0.229550 | 0.809702 | -0.489898 | 1.001113 | -0.881294 |
| 896 | 0.370633 | 0.515943 | -0.118507 | -0.625646 | -0.550362 | -0.864854 | 1.215148 | 0.336112 | -0.484655 | -0.534905 | -0.229550 | 0.809702 | -0.489898 | 1.001113 | -0.881294 |
| 897 | 0.370633 | -1.938199 | -0.118507 | 0.354763 | -0.550362 | 1.463238 | -0.822945 | -0.855469 | 2.063325 | -0.534905 | -0.229550 | -1.235023 | -0.489898 | 1.001113 | -0.881294 |
| 898 | -1.639776 | 0.515943 | 0.349019 | -0.214808 | -0.550362 | 1.423779 | -0.822945 | -0.855469 | -0.484655 | 1.869492 | -0.229550 | 0.809702 | -0.489898 | -0.998888 | 1.134695 |
899 rows × 15 columns
Train-test split:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=10)
Standalone SVM:
# performing classification using Support Vector Machines (SVM).
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_classification_report = classification_report(y_test, svm_predictions)
print('Accuracy of SVM:',svm_accuracy)
print('Classification Report of SVM:')
print(svm_classification_report)
Accuracy of SVM: 0.8611111111111112
Classification Report of SVM:
precision recall f1-score support
0 0.89 0.79 0.84 82
1 0.84 0.92 0.88 98
accuracy 0.86 180
macro avg 0.87 0.86 0.86 180
weighted avg 0.86 0.86 0.86 180
SVM with Bagging:
# performing bagging classification using Support Vector Machines (SVM).
bagging_svm = BaggingClassifier(estimator=SVC(), n_estimators=10, random_state=10)
bagging_svm.fit(X_train, y_train)
bagging_svm_predict = bagging_svm.predict(X_test)
bagging_svm_accuracy = accuracy_score(y_test, bagging_svm_predict)
bagging_svm_classification_report = classification_report(y_test, bagging_svm_predict)
print('Accuracy of Bagging classifier with SVM:',bagging_svm_accuracy)
print('Classification Report of Bagging classifier with SVM:')
print(bagging_svm_classification_report)
Accuracy of Bagging classifier with SVM: 0.8666666666666667
Classification Report of Bagging classifier with SVM:
precision recall f1-score support
0 0.90 0.79 0.84 82
1 0.84 0.93 0.88 98
accuracy 0.87 180
macro avg 0.87 0.86 0.86 180
weighted avg 0.87 0.87 0.87 180
Using Decision Tree:
# training a decision tree classifier model.
decision_tree_model = DecisionTreeClassifier(random_state=10)
decision_tree_model.fit(X_train, y_train)
decision_predict = decision_tree_model.predict(X_test)
decision_accuracy = accuracy_score(y_test, decision_predict)
decision_classification_report = classification_report(y_test, decision_predict)
print('Accuracy of Decision Tree:',decision_accuracy)
print('Classification Report of Decision Tree:')
print(decision_classification_report)
Accuracy of Decision Tree: 0.7666666666666667
Classification Report of Decision Tree:
precision recall f1-score support
0 0.79 0.66 0.72 82
1 0.75 0.86 0.80 98
accuracy 0.77 180
macro avg 0.77 0.76 0.76 180
weighted avg 0.77 0.77 0.76 180
Using standalone model as well as Bagging and check if we notice any difference in performance
# performing bagging classification using a decision tree as the base estimator.
bagging_decision = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=10, random_state=10)
bagging_decision.fit(X_train, y_train)
bagging_decision_predict = bagging_decision.predict(X_test)
bagging_decision_accuracy = accuracy_score(y_test, bagging_decision_predict)
bagging_decision_classification_report = classification_report(y_test, bagging_decision_predict)
print('Accuracy of Bagging Classifier with Decision Tree :',bagging_decision_accuracy)
print('Classification Report of Bagging Classifier with Decision Tree :')
print(bagging_decision_classification_report)
Accuracy of Bagging Classifier with Decision Tree : 0.8166666666666667
Classification Report of Bagging Classifier with Decision Tree :
precision recall f1-score support
0 0.81 0.78 0.80 82
1 0.82 0.85 0.83 98
accuracy 0.82 180
macro avg 0.82 0.81 0.81 180
weighted avg 0.82 0.82 0.82 180
Observing the SVM and SVM with Bagging Classifier they almost have simialar perfomance with SVM with Bagging Classifier Perfoming a bit better.
We can also observer that the SVM classifier outperforms the Decision Tree classifier, as it has a higher accuracy (86.12% vs. 76.67%).
Bagging is an ensemble approach in which many instances of the same classifier are trained on various subsets of the data and then their predictions are combined. According to the accuracy measures, both SVM and Decision Tree classifiers benefit from bagging, resulting in higher accuracies when compared to their standalone versions.
When to Use Bagging: Bagging is very effective in the following situations:
a. Decision Tree Classifier: Decision trees, especially on large datasets, are prone to overfitting. We can decrease overfitting and increase the model's generalisation performance by utilising bagging. The Decision Tree with Bagging Classifier attained an accuracy of 81.67%, which is greater than the standalone Decision Tree accuracy of 76.67%, as seen in the accuracy metrics.
b. Data Variability: Bagging works effectively when the data is very variable. It reduces variation by averaging out predictions from numerous models, resulting in more solid and reliable outcomes. This is especially useful when dealing with noisy or erratic data.
c. SVM Classifier: While SVM is typically more resistant to overfitting than decision trees, applying bagging can enhance its performance, particularly when dealing with complicated, overlapping, or noisy data. The accuracy metrics reveal that the SVM + Bagging Classifier obtained a little better accuracy of 86.67% than the standalone SVM accuracy of 86.12%.
In conclusion, bagging is especially valuable for decision tree classifiers since it reduces overfitting and improves generalisation. It can, however, give minor performance advantages for SVM classifiers, particularly when working with complicated or noisy data. Bagging improves model resilience and can be a useful ensemble approach for improving prediction performance in both classifiers.